14 research outputs found

    Combining Visual and Textual Features for Semantic Segmentation of Historical Newspapers

    Full text link
    The massive amounts of digitized historical documents acquired over the last decades naturally lend themselves to automatic processing and exploration. Research work seeking to automatically process facsimiles and extract information thereby are multiplying with, as a first essential step, document layout analysis. If the identification and categorization of segments of interest in document images have seen significant progress over the last years thanks to deep learning techniques, many challenges remain with, among others, the use of finer-grained segmentation typologies and the consideration of complex, heterogeneous documents such as historical newspapers. Besides, most approaches consider visual features only, ignoring textual signal. In this context, we introduce a multimodal approach for the semantic segmentation of historical newspapers that combines visual and textual features. Based on a series of experiments on diachronic Swiss and Luxembourgish newspapers, we investigate, among others, the predictive power of visual and textual features and their capacity to generalize across time and sources. Results show consistent improvement of multimodal models in comparison to a strong visual baseline, as well as better robustness to high material variance

    Combining Visual and Textual Features for Semantic Segmentation of Historical Newspapers

    Full text link
    The massive amounts of digitized historical documents acquired over the last decades naturally lend themselves to automatic processing and exploration. Research work seeking to automatically process facsimiles and extract information thereby are multiplying with, as a first essential step, document layout analysis. If the identification and categorization of segments of interest in document images have seen significant progress over the last years thanks to deep learning techniques, many challenges remain with, among others, the use of finer-grained segmentation typologies and the consideration of complex, heterogeneous documents such as historical newspapers. Besides, most approaches consider visual features only, ignoring textual signal. In this context, we introduce a multimodal approach for the semantic segmentation of historical newspapers that combines visual and textual features. Based on a series of experiments on diachronic Swiss and Luxembourgish newspapers, we investigate, among others, the predictive power of visual and textual features and their capacity to generalize across time and sources. Results show consistent improvement of multimodal models in comparison to a strong visual baseline, as well as better robustness to high material variance

    Une approche computationnelle du cadastre napoléonien de Venise

    Get PDF
    Au dĂ©but du xixᔉ siĂšcle, l’administration napolĂ©onienne impose Ă  la ville de Venise la mise en place d’un nouveau systĂšme de description standardisĂ© pour rendre compte de maniĂšre objective de la forme et des fonctions du tissu urbain. Le cadastre, dĂ©ployĂ© Ă  l’échelle europĂ©enne, offre pour la premiĂšre fois une vue articulĂ©e et prĂ©cise de la structure de la ville et de ses activitĂ©s grĂące Ă  une approche mĂ©thodique et Ă  des catĂ©gories standardisĂ©es. Les techniques numĂ©riques, basĂ©es notamment sur l’apprentissage profond, permettent aujourd’hui d’extraire de ces documents une reprĂ©sentation Ă  la fois prĂ©cise et dense de la ville et de ses habitants. En s’attachant Ă  vĂ©rifier systĂ©matiquement la cohĂ©rence de l’information extraite, ces techniques Ă©valuent aussi la prĂ©cision et la systĂ©maticitĂ© du travail des arpenteurs et des sondeurs de l’Empire et qualifient par consĂ©quent, de façon indirecte, la confiance Ă  accorder aux informations extraites. Cet article revient sur l’histoire de ce protosystĂšme computationnel, dĂ©crit la maniĂšre dont les techniques numĂ©riques offrent non seulement une documentation systĂ©matique, mais aussi des perspectives d’extraction d’informations latentes, encore non explicitĂ©es, mais implicitement prĂ©sentes dans ce systĂšme d’information du passĂ©.At the beginning of the 19th century, the Napoleonic administration introduced a new standardised description system to give an objective account of the form and functions of the city of Venice. The cadastre, deployed on a European scale, was offering for the first time an articulated and precise view of the structure of the city and its activities, through a methodical approach and standardised categories. With the use of digital techniques, based in particular on deep learning, it is now possible to extract from these documents an accurate and dense representation of the city and its inhabitants. By systematically checking the consistency of the extracted information, these techniques also evaluate the precision and systematicity of the surveyors’ work and therefore indirectly qualify the trust to be placed in the extracted information. This article reviews the history of this computational protosystem and describes how digital techniques offer not only systematic documentation, but also extraction perspectives for latent information, as yet uncharted, but implicitly present in this information system of the past

    Historical newspaper semantic segmentation using visual and textual features

    No full text
    Mass digitization and the opening of digital libraries gave access to a huge amount of historical newspapers. In order to bring structure into these documents, current techniques generally proceed in two distinct steps. First, they segment the digitized images into generic articles and then classify the text of the articles into finer-grained categories. Unfortunately, by losing the link between layout and text, these two steps are not able to account for the fact that newspaper content items have distinctive visual features. This project proposes two main novelties. Firstly, it introduces the idea of merging the segmentation and classification steps, resulting in a fine- grained semantic segmentation of newspapers images. Secondly, it proposes to use textual features under the form of embeddings maps at segmentation step. The semantic segmentation with four categories (feuilleton, weather forecast, obituary, and stock exchange table) is done using a fully convolutional neural network and reaches a mIoU of 79.3%. The introduction of embeddings maps improves the overall performances by 3% and the generalization across time and newspapers by 8% and 12%, respectively. This shows a strong potential to consider the semantic aspect in the segmentation of newspapers and to use textual features to improve generalization

    Language Resources for Historical Newspapers: the Impresso Collection

    Full text link
    Following decades of massive digitization, an unprecedented amount of historical document facsimiles can now be retrieved and accessed via cultural heritage online portals. If this represents a huge step forward in terms of preservation and accessibility, the next fundamental challenge– and real promise of digitization– is to exploit the contents of these digital assets, and therefore to adapt and develop appropriate language technologies to search and retrieve information from this ‘Big Data of the Past’. Yet, the application of text processing tools on historical documents in general, and historical newspapers in particular, poses new challenges, and crucially requires appropriate language resources. In this context, this paper presents a collection of historical newspaper data sets composed of text and image resources, curated and published within the context of the ‘impresso - Media Monitoring of the Past’ project. With corpora, benchmarks, semantic annotations and language models in French, German and Luxembourgish covering ca. 200 years, the objective of the impresso resource collection is to contribute to historical language resources, and thereby strengthen the robustness of approaches to non-standard inputs and foster efficient processing of historical documents

    Combining Visual and Textual Features for Semantic Segmentation of Historical Newspapers

    Full text link
    The massive amounts of digitized historical documents acquired over the last decades naturally lend themselves to automatic processing and exploration. Research work seeking to automatically process facsimiles and extract information thereby are multiplying with, as a first essential step, document layout analysis. If the identification and categorization of segments of interest in document images have seen significant progress over the last years thanks to deep learning techniques, many challenges remain with, among others, the use of finer-grained segmentation typologies and the consideration of complex, heterogeneous documents such as historical newspapers. Besides, most approaches consider visual features only, ignoring textual signal. In this context, we introduce a multimodal approach for the semantic segmentation of historical newspapers that combines visual and textual features. Based on a series of experiments on diachronic Swiss and Luxembourgish newspapers, we investigate, among others, the predictive power of visual and textual features and their capacity to generalize across time and sources. Results show consistent improvement of multimodal models in comparison to a strong visual baseline, as well as better robustness to high material variance

    Datasets and Models for Historical Newspaper Article Segmentation

    No full text
    Dataset and models used and produced in the work described in the paper "Combining Visual and Textual Features for Semantic Segmentation of Historical Newspapers": https://infoscience.epfl.ch/record/282863?ln=e

    Repopulating Paris: massive extraction of 4 Million addresses from city directories between 1839 and 1922

    No full text
    In 1839, in Paris, the Maison Didot bought the Bottin company. SĂ©bastien Bottin trained as a statistician was the initiator of a high impact yearly publication, called “Almanachs" containing the listing of residents, businesses and institutions, arranged geographically, alphabetically and by activity typologies (Fig. 1). These regular publications encountered a great success. In 1820, the Parisian Bottin Almanach contained more than 50 000 addresses and until the end of the 20th century the word “Bottin” was the colloquial term to designate a city directory in France. The publication of the “Didot-Bottin” continued at an annual rhythm, mapping the evolution of the active population of Paris and other cities in France.The relevance of automatically mining city directories for historical reconstruction has already been argued by several authors (e.g Osborne, N., Hamilton, G. and Macdonald, S. 2014 or Berenbaum, D. et al. (2016). This article reports on the extraction and analysis of the data contained in “Didot-Bottin” covering the period 1839-1922 for Paris, digitized by the Bibliotheque nationale de France. We process more than 27 500 pages to create a database of 4,2 Million entries linking addresses, person mention and activities

    Repopulating Paris: massive extraction of 4 Million addresses from city directories between 1839 and 1922.

    No full text
    Abstract of paper 0878 presented at the Digital Humanities Conference 2019 (DH2019), Utrecht , the Netherlands 9-12 July, 2019

    Language Resources for Historical Newspapers: the Impresso Collection

    No full text
    Following decades of massive digitization, an unprecedented amount of historical document facsimiles can now be retrieved and accessed via cultural heritage online portals. If this represents a huge step forward in terms of preservation and accessibility, the next fundamental challenge-- and real promise of digitization-- is to exploit the contents of these digital assets, and therefore to adapt and develop appropriate language technologies to search and retrieve information from this `Big Data of the Past'. Yet, the application of text processing tools on historical documents in general, and historical newspapers in particular, poses new challenges, and crucially requires appropriate language resources. In this context, this paper presents a collection of historical newspaper data sets composed of text and image resources, curated and published within the context of the `impresso - Media Monitoring of the Past' project. With corpora, benchmarks, semantic annotations and language models in French, German and Luxembourgish covering ca. 200 years, the objective of the impresso resource collection is to contribute to historical language resources, and thereby strengthen the robustness of approaches to non-standard inputs and foster efficient processing of historical documents
    corecore